Visualisation Project for Airbnb Price Setting in Barcelona Team G

Rundong Liu, Farah Kaskas, Denise Eng, Vladislav Stanev

Project Objective

1. Dentify key factors which hosts should consider when setting prices in Barcelona

2. Develop relevant pricing recommendations for existing and new hosts

In [1]:
import plotly.express as px
import pandas as pd 
import numpy as np
import seaborn as sns
sns.set()
import matplotlib.pyplot as plt
from folium import plugins
import folium
import geopandas as gpd
from folium.plugins import FastMarkerCluster
from branca.colormap import LinearColormap
import csv

Becasue "listing_details.csv" has 104 columns and thus contains too much noise, we have created a file called "listing_valid.csv" which contains more sensable and organised data for us to analyse

In [2]:
df = pd.read_csv('listing_valid.csv')
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20429 entries, 0 to 20428
Data columns (total 36 columns):
id                                20429 non-null object
name                              20414 non-null object
host_id                           20429 non-null object
host_name                         20412 non-null object
neighbourhood_group               20428 non-null object
neighbourhood                     20428 non-null object
latitude                          20429 non-null object
longitude                         20428 non-null float64
room_type                         20428 non-null object
price                             20428 non-null float64
minimum_nights                    20428 non-null float64
number_of_reviews                 20428 non-null float64
last_review                       16152 non-null object
reviews_per_month                 16152 non-null float64
calculated_host_listings_count    20428 non-null float64
availability_365                  20428 non-null float64
property_type                     20428 non-null object
accommodates                      20428 non-null float64
first_review                      16152 non-null object
review_scores_value               15935 non-null float64
review_scores_cleanliness         15939 non-null float64
review_scores_location            15934 non-null float64
review_scores_accuracy            15938 non-null float64
review_scores_communication       15943 non-null float64
review_scores_checkin             15932 non-null float64
review_scores_rating              15947 non-null float64
maximum_nights                    20428 non-null float64
listing_url                       20428 non-null object
host_is_superhost                 20411 non-null object
host_about                        12765 non-null object
host_response_time                17684 non-null object
host_response_rate                17684 non-null object
street                            20427 non-null object
weekly_price                      1213 non-null object
monthly_price                     1349 non-null object
market                            20411 non-null object
dtypes: float64(16), object(20)
memory usage: 5.6+ MB
//anaconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3057: DtypeWarning:

Columns (0,2,6,33,34) have mixed types. Specify dtype option on import or set low_memory=False.

1.Subset selection

In [3]:
df=df[df['accommodates']==2]
df=df[df['room_type']=='Private room']
df=df[df['property_type']=='Apartment']
df=df[df['price'] < 400]
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5585 entries, 3 to 20419
Data columns (total 36 columns):
id                                5585 non-null object
name                              5575 non-null object
host_id                           5585 non-null object
host_name                         5582 non-null object
neighbourhood_group               5585 non-null object
neighbourhood                     5585 non-null object
latitude                          5585 non-null object
longitude                         5585 non-null float64
room_type                         5585 non-null object
price                             5585 non-null float64
minimum_nights                    5585 non-null float64
number_of_reviews                 5585 non-null float64
last_review                       4785 non-null object
reviews_per_month                 4785 non-null float64
calculated_host_listings_count    5585 non-null float64
availability_365                  5585 non-null float64
property_type                     5585 non-null object
accommodates                      5585 non-null float64
first_review                      4785 non-null object
review_scores_value               4722 non-null float64
review_scores_cleanliness         4722 non-null float64
review_scores_location            4722 non-null float64
review_scores_accuracy            4721 non-null float64
review_scores_communication       4724 non-null float64
review_scores_checkin             4720 non-null float64
review_scores_rating              4724 non-null float64
maximum_nights                    5585 non-null float64
listing_url                       5585 non-null object
host_is_superhost                 5582 non-null object
host_about                        3079 non-null object
host_response_time                4366 non-null object
host_response_rate                4366 non-null object
street                            5584 non-null object
weekly_price                      337 non-null object
monthly_price                     300 non-null object
market                            5578 non-null object
dtypes: float64(16), object(20)
memory usage: 1.6+ MB

1.1 Encode the superhost as a dummy column and update the data set

In [4]:
df = df.merge(pd.get_dummies(df['host_is_superhost']), left_index = True, right_index = True)
df = df.drop(columns = ['f']).rename(columns = {'t':'super host'})
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5585 entries, 3 to 20419
Data columns (total 37 columns):
id                                5585 non-null object
name                              5575 non-null object
host_id                           5585 non-null object
host_name                         5582 non-null object
neighbourhood_group               5585 non-null object
neighbourhood                     5585 non-null object
latitude                          5585 non-null object
longitude                         5585 non-null float64
room_type                         5585 non-null object
price                             5585 non-null float64
minimum_nights                    5585 non-null float64
number_of_reviews                 5585 non-null float64
last_review                       4785 non-null object
reviews_per_month                 4785 non-null float64
calculated_host_listings_count    5585 non-null float64
availability_365                  5585 non-null float64
property_type                     5585 non-null object
accommodates                      5585 non-null float64
first_review                      4785 non-null object
review_scores_value               4722 non-null float64
review_scores_cleanliness         4722 non-null float64
review_scores_location            4722 non-null float64
review_scores_accuracy            4721 non-null float64
review_scores_communication       4724 non-null float64
review_scores_checkin             4720 non-null float64
review_scores_rating              4724 non-null float64
maximum_nights                    5585 non-null float64
listing_url                       5585 non-null object
host_is_superhost                 5582 non-null object
host_about                        3079 non-null object
host_response_time                4366 non-null object
host_response_rate                4366 non-null object
street                            5584 non-null object
weekly_price                      337 non-null object
monthly_price                     300 non-null object
market                            5578 non-null object
super host                        5585 non-null uint8
dtypes: float64(16), object(20), uint8(1)
memory usage: 1.6+ MB
In [5]:
df.head()
Out[5]:
id name host_id host_name neighbourhood_group neighbourhood latitude longitude room_type price ... listing_url host_is_superhost host_about host_response_time host_response_rate street weekly_price monthly_price market super host
3 25786 NICE ROOM AVAILABLE IN THE HEART OF GRACIA 108310 Pedro Gràcia la Vila de Gràcia 41.40145 2.15645 Private room 32.0 ... https://www.airbnb.com/rooms/25786 t Hola!\r\nas i say in my add i look for enthusi... within an hour 100% Barcelona, Barcelona, Spain NaN NaN Barcelona 1
8 34241 Private Double room - Plaza Real 73163 Andres Ciutat Vella el Barri Gòtic 41.37916 2.17535 Private room 100.0 ... https://www.airbnb.com/rooms/34241 f Hello I am a Professional designer, a traveler... within a few hours 100% Barcelona, CT, Spain $280.00 $700.00 Barcelona 0
10 35379 Double 02 CasanovaRooms Barcelona 152232 Pablo Eixample l'Antiga Esquerra de l'Eixample 41.39036 2.15274 Private room 40.0 ... https://www.airbnb.com/rooms/35379 t I was born and raised in Argentina and I moved... within an hour 100% Barcelona, Catalunya, Spain NaN NaN Barcelona 1
13 35392 Double 01 CasanovaRooms Barcelona 152232 Pablo Eixample l'Antiga Esquerra de l'Eixample 41.39082 2.15078 Private room 45.0 ... https://www.airbnb.com/rooms/35392 t I was born and raised in Argentina and I moved... within an hour 100% Barcelona, Catalunya, Spain NaN NaN Barcelona 1
17 49213 Descalzos en el parque, relaxing and cool room 208154 Cate Sants-Montjuïc el Poble Sec 41.37315 2.16640 Private room 35.0 ... https://www.airbnb.com/rooms/49213 f We are a couple Italian/chilean, middle age, a... within an hour 100% Barcelona, Catalonia, Spain NaN NaN Barcelona 0

5 rows × 37 columns

2. Average Price in different neighbourhoods

We found that Sarrià-Sant Gervasi and Ciutat Vella are regions with the highest and the second highest average daily price respectively

In [6]:
geo = gpd.read_file("neighbourhoods.geojson")
mean_price = pd.DataFrame(df.groupby('neighbourhood_group')['price'].mean().sort_values(ascending=True)).reset_index()
In [7]:
geo.head()
Out[7]:
neighbourhood neighbourhood_group geometry
0 el Raval Ciutat Vella MULTIPOLYGON (((2.17739 41.37535, 2.17853 41.3...
1 el Barri Gòtic Ciutat Vella MULTIPOLYGON (((2.18288 41.38077, 2.18290 41.3...
2 la Dreta de l'Eixample Eixample MULTIPOLYGON (((2.17093 41.40185, 2.17333 41.4...
3 l'Antiga Esquerra de l'Eixample Eixample MULTIPOLYGON (((2.15972 41.38301, 2.15859 41.3...
4 la Nova Esquerra de l'Eixample Eixample MULTIPOLYGON (((2.14999 41.37562, 2.14983 41.3...
In [8]:
barca = pd.merge(geo,mean_price, on='neighbourhood_group', how="left")
In [9]:
barca.rename(columns={'price': 'average_price'}, inplace=True)
barca.average_price = barca.average_price.round(decimals=0)

barcelona_map = folium.Map(location=[41.38879, 2.15899], zoom_start=12)
accidents = plugins.MarkerCluster().add_to(barcelona_map)

color_scale = LinearColormap(['yellow','red'], vmin = 42, vmax=50)
map_dict = barca.set_index('neighbourhood_group')['average_price'].to_dict()
In [10]:
barca.head()
Out[10]:
neighbourhood neighbourhood_group geometry average_price
0 el Raval Ciutat Vella MULTIPOLYGON (((2.17739 41.37535, 2.17853 41.3... 53.0
1 el Barri Gòtic Ciutat Vella MULTIPOLYGON (((2.18288 41.38077, 2.18290 41.3... 53.0
2 la Dreta de l'Eixample Eixample MULTIPOLYGON (((2.17093 41.40185, 2.17333 41.4... 49.0
3 l'Antiga Esquerra de l'Eixample Eixample MULTIPOLYGON (((2.15972 41.38301, 2.15859 41.3... 49.0
4 la Nova Esquerra de l'Eixample Eixample MULTIPOLYGON (((2.14999 41.37562, 2.14983 41.3... 49.0
In [11]:
features = df.columns.sort_values().tolist()

def get_color(feature):
    value = map_dict.get(feature['properties']['neighbourhood_group'])
    return color_scale(value)

folium.GeoJson(data=barca,
               name='Barcelona',
               tooltip=folium.features.GeoJsonTooltip(fields=['neighbourhood_group', 'average_price'],
                                                      labels=True,
                                                      sticky=False,
                                                        ),
               style_function= lambda feature: {
                   'fillColor': get_color(feature),
                   'color': 'black',
                   'weight': 1,
                   'dashArray': '5, 5',
                   'fillOpacity':0.5,
                   },
               highlight_function=lambda feature: {'weight':1, 'fillColor': get_color(feature), 'fillOpacity': 0.5}).add_to(barcelona_map)
//anaconda3/lib/python3.7/site-packages/pyproj/crs.py:77: FutureWarning:

'+init=<authority>:<code>' syntax is deprecated. '<authority>:<code>' is the preferred initialization method.

Out[11]:
<folium.features.GeoJson at 0x1124f7eb8>
In [12]:
import branca

colormap = branca.colormap.linear.YlOrRd_05.scale(36, 60)
colormap = colormap.to_step('bottomright',index=[36, 43, 49, 55])
colormap.caption = 'Average Airbnb price per night (per neighbourhood)'
colormap.add_to(barcelona_map)
folium.map.LayerControl('topright', collapsed=False).add_to(colormap)

barcelona_map
Out[12]:

3. Price vs Review Ratings

Here we have found that there seems to be a correlation between review ratings (all types) and daily average price, the next session will show how exactly they are correlated

In [13]:
from plotly.subplots import make_subplots
import plotly.graph_objects as go


fig = make_subplots(rows=1, cols=5, shared_yaxes=True, vertical_spacing=0.02)

fig.add_trace(go.Scatter(x=df["review_scores_communication"], y=df["price"],mode='markers' ), row=1, col=1)
fig.add_trace(go.Scatter(x=df["review_scores_location"], y=df["price"], mode='markers'), row=1, col=2)
fig.add_trace(go.Scatter(x=df["review_scores_value"], y=df["price"], mode='markers'), row=1, col=3)
fig.add_trace(go.Scatter(x=df["review_scores_cleanliness"], y=df["price"], mode='markers'), row=1, col=4)
fig.add_trace(go.Scatter(x=df["review_scores_accuracy"], y=df["price"], mode='markers'), row=1, col=5)


# Update xaxis properties
fig.update_xaxes(title_text="communication rating", row=1, col=1)
fig.update_xaxes(title_text="location rating", row=1, col=2)
fig.update_xaxes(title_text="value rating", row=1, col=3)
fig.update_xaxes(title_text="cleanliness rating", row=1, col=4)
fig.update_xaxes(title_text="accuracy rating",  row=1, col=5)
fig.update_yaxes(title_text="price", row=1, col=1)

fig.update_layout(showlegend=False, title_text="Effect of Review Scores on Price")


fig.show()

4. Price vs A group of factors

4.1 Data Filtering

In [14]:
t4 = df.filter(items= ['id','neighbourhood','latitude','longitude','review_scores_rating','price','review_scores_checkin','review_scores_communication',\
                      'review_scores_accuracy','review_scores_location','review_scores_cleanliness','review_scores_value','neighbourhood_group',\
                      'host_is_superhost','number_of_reviews','super host'])
t4['latitude'] = t4['latitude'].astype(float)
t4.head()
Out[14]:
id neighbourhood latitude longitude review_scores_rating price review_scores_checkin review_scores_communication review_scores_accuracy review_scores_location review_scores_cleanliness review_scores_value neighbourhood_group host_is_superhost number_of_reviews super host
3 25786 la Vila de Gràcia 41.40145 2.15645 95.0 32.0 10.0 10.0 10.0 10.0 9.0 10.0 Gràcia t 268.0 1
8 34241 el Barri Gòtic 41.37916 2.17535 68.0 100.0 7.0 9.0 8.0 8.0 8.0 7.0 Ciutat Vella f 8.0 0
10 35379 l'Antiga Esquerra de l'Eixample 41.39036 2.15274 94.0 40.0 10.0 10.0 10.0 10.0 9.0 9.0 Eixample t 277.0 1
13 35392 l'Antiga Esquerra de l'Eixample 41.39082 2.15078 95.0 45.0 10.0 10.0 10.0 9.0 9.0 10.0 Eixample t 203.0 1
17 49213 el Poble Sec 41.37315 2.16640 99.0 35.0 10.0 10.0 10.0 10.0 10.0 10.0 Sants-Montjuïc f 48.0 0

4.2 Distance Calculating Function

We take the center of Barcelona as (41.3851, 2.1734) and compute the distance of each single neighbourhood to the city center, with unit in Kilometers

Algorithm Refernce: https://gist.github.com/rochacbruno/2883505

In [15]:
import math
def distance(origin, destination):
    lat1, lon1 = origin
    lat2, lon2 = destination
    radius = 6371 # km

    dlat = math.radians(lat2-lat1)
    dlon = math.radians(lon2-lon1)
    a = math.sin(dlat/2) * math.sin(dlat/2) + math.cos(math.radians(lat1)) \
        * math.cos(math.radians(lat2)) * math.sin(dlon/2) * math.sin(dlon/2)
    c = 2 * math.atan2(math.sqrt(a), math.sqrt(1-a))
    d = radius * c

    return d
In [16]:
t6 = t4.groupby('neighbourhood').mean()
for i,r in t6.iterrows():
    t6.at[i,'distance'] = distance((r['latitude'], r['longitude']), (41.3851, 2.1734))
temp = t6.merge(t4.filter(items = ['neighbourhood','neighbourhood_group']).set_index('neighbourhood'), left_index = True, right_index = True)
t7 = temp.drop_duplicates()
modify = {"review_scores_rating": "review rating", "review_scores_checkin": "checkin rating",\
          "review_scores_communication": "communication rating",\
          "review_scores_accuracy":'accuracy rating',\
          'review_scores_location':'location rating',\
          'review_scores_cleanliness': 'cleanliness rating',\
          'review_scores_value': 'value rating',
          'number_of_reviews': 'number of reviews',
          'neighbourhood_group':'neighbourhood group' }
t7 = t7.rename(columns = modify).reset_index()
t7.head()
Out[16]:
neighbourhood latitude longitude review rating price checkin rating communication rating accuracy rating location rating cleanliness rating value rating number of reviews super host distance neighbourhood group
0 Can Baró 41.416054 2.162256 93.428571 39.857143 9.857143 9.857143 9.571429 9.285714 9.428571 9.571429 21.285714 0.142857 3.565262 Horta-Guinardó
1 Can Peguera 41.435820 2.167310 78.000000 30.000000 9.000000 8.000000 8.000000 7.000000 9.000000 8.000000 8.000000 0.000000 5.662628 Nou Barris
2 Canyelles 41.445140 2.168280 100.000000 38.000000 10.000000 10.000000 8.000000 10.000000 10.000000 10.000000 2.000000 0.000000 6.689782 Nou Barris
3 Ciutat Meridiana 41.462074 2.177366 92.500000 35.200000 9.750000 9.750000 9.250000 8.500000 9.250000 9.000000 7.400000 0.000000 8.565504 Nou Barris
4 Diagonal Mar i el Front Marítim del Poblenou 41.406300 2.211650 91.567568 46.434783 9.540541 9.486486 9.459459 9.540541 9.297297 9.081081 19.739130 0.326087 3.967030 Sant Martí
In [17]:
t7.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 70 entries, 0 to 69
Data columns (total 15 columns):
neighbourhood           70 non-null object
latitude                70 non-null float64
longitude               70 non-null float64
review rating           70 non-null float64
price                   70 non-null float64
checkin rating          70 non-null float64
communication rating    70 non-null float64
accuracy rating         70 non-null float64
location rating         70 non-null float64
cleanliness rating      70 non-null float64
value rating            70 non-null float64
number of reviews       70 non-null float64
super host              70 non-null float64
distance                70 non-null float64
neighbourhood group     70 non-null object
dtypes: float64(13), object(2)
memory usage: 8.3+ KB

We have 70 neighbourhoods with its average ratings, price, distances, number of reviews etc. Now we plot the correlation table in heatmap

In [18]:
correlation_table = t7.filter(items = ['price','checkin rating','communication rating',\
                  'accuracy rating','location rating','cleanliness rating','value rating','number of reviews','distance','super host']).corr()
correlation_table
Out[18]:
price checkin rating communication rating accuracy rating location rating cleanliness rating value rating number of reviews distance super host
price 1.000000 0.085911 0.232144 0.244054 0.606166 0.080793 0.090149 0.323175 -0.619853 0.266349
checkin rating 0.085911 1.000000 0.781702 0.483725 0.548672 0.686387 0.712913 -0.136598 -0.071835 0.221545
communication rating 0.232144 0.781702 1.000000 0.524125 0.677372 0.520598 0.715297 0.043770 -0.154980 0.319455
accuracy rating 0.244054 0.483725 0.524125 1.000000 0.428834 0.353028 0.501723 0.125949 -0.217096 0.187341
location rating 0.606166 0.548672 0.677372 0.428834 1.000000 0.440007 0.530614 0.385456 -0.639032 0.257509
cleanliness rating 0.080793 0.686387 0.520598 0.353028 0.440007 1.000000 0.765493 -0.049804 -0.051353 0.095163
value rating 0.090149 0.712913 0.715297 0.501723 0.530614 0.765493 1.000000 -0.114950 -0.034398 0.115932
number of reviews 0.323175 -0.136598 0.043770 0.125949 0.385456 -0.049804 -0.114950 1.000000 -0.570229 0.113043
distance -0.619853 -0.071835 -0.154980 -0.217096 -0.639032 -0.051353 -0.034398 -0.570229 1.000000 -0.147897
super host 0.266349 0.221545 0.319455 0.187341 0.257509 0.095163 0.115932 0.113043 -0.147897 1.000000
In [19]:
import plotly.figure_factory as ff
z=correlation_table.values.tolist()
z_text = np.around(z, decimals=3)
fig = ff.create_annotated_heatmap(
                   z=correlation_table.values.tolist(),
                   x=list(correlation_table.columns),
                   y=list(correlation_table.index),
                   colorscale='Viridis',
    annotation_text=z_text
                   )
fig.show()

5. Price vs Distance from the center of Barcelona (41.3851, 2.1734)

5.1 Data Filtering

Becuase of the "long-tail" distribution of the number of reviews, we decide to keep listings with # reviews greater than 4

In [20]:
fig = px.histogram(df, x="number_of_reviews")
fig.show()
In [21]:
df2 = df[df['number_of_reviews']>=4]
In [22]:
t5 = df2.filter(items= ['id','neighbourhood','latitude','longitude','review_scores_rating','price','review_scores_checkin','review_scores_communication',\
                      'review_scores_accuracy','review_scores_location','review_scores_cleanliness','review_scores_value','neighbourhood_group',\
                      'host_is_superhost','number_of_reviews','super host'])
t5['latitude'] = t5['latitude'].astype(float)
t5.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 3945 entries, 3 to 19567
Data columns (total 16 columns):
id                             3945 non-null object
neighbourhood                  3945 non-null object
latitude                       3945 non-null float64
longitude                      3945 non-null float64
review_scores_rating           3944 non-null float64
price                          3945 non-null float64
review_scores_checkin          3944 non-null float64
review_scores_communication    3944 non-null float64
review_scores_accuracy         3944 non-null float64
review_scores_location         3944 non-null float64
review_scores_cleanliness      3944 non-null float64
review_scores_value            3944 non-null float64
neighbourhood_group            3945 non-null object
host_is_superhost              3943 non-null object
number_of_reviews              3945 non-null float64
super host                     3945 non-null uint8
dtypes: float64(11), object(4), uint8(1)
memory usage: 497.0+ KB
In [23]:
t6 = t5.groupby('neighbourhood').mean()
for i,r in t6.iterrows():
    t6.at[i,'distance'] = distance((r['latitude'], r['longitude']), (41.3851, 2.1734))
t6.info()
<class 'pandas.core.frame.DataFrame'>
Index: 69 entries, Can Baró to les Tres Torres
Data columns (total 13 columns):
latitude                       69 non-null float64
longitude                      69 non-null float64
review_scores_rating           69 non-null float64
price                          69 non-null float64
review_scores_checkin          69 non-null float64
review_scores_communication    69 non-null float64
review_scores_accuracy         69 non-null float64
review_scores_location         69 non-null float64
review_scores_cleanliness      69 non-null float64
review_scores_value            69 non-null float64
number_of_reviews              69 non-null float64
super host                     69 non-null float64
distance                       69 non-null float64
dtypes: float64(13)
memory usage: 10.0+ KB

As we can see, with review > 4, we get one neighbourhood away, so we have 69 now

In [24]:
temp = t6.merge(t5.filter(items = ['neighbourhood','neighbourhood_group']).set_index('neighbourhood'), left_index = True, right_index = True)
t7 = temp.drop_duplicates()
modify = {"review_scores_rating": "review rating", "review_scores_checkin": "checkin rating",\
          "review_scores_communication": "communication rating",\
          "review_scores_accuracy":'accuracy rating',\
          'review_scores_location':'location rating',\
          'review_scores_cleanliness': 'cleanliness rating',\
          'review_scores_value': 'value rating',
          'number_of_reviews': 'number of reviews',
          'neighbourhood_group':'neighbourhood group' }
t7 = t7.rename(columns = modify).reset_index()
t7.head()
Out[24]:
neighbourhood latitude longitude review rating price checkin rating communication rating accuracy rating location rating cleanliness rating value rating number of reviews super host distance neighbourhood group
0 Can Baró 41.416122 2.162038 94.000000 34.000000 9.833333 9.833333 9.500000 9.166667 9.333333 9.500000 24.500000 0.166667 3.577256 Horta-Guinardó
1 Can Peguera 41.435820 2.167310 78.000000 30.000000 9.000000 8.000000 8.000000 7.000000 9.000000 8.000000 8.000000 0.000000 5.662628 Nou Barris
2 Ciutat Meridiana 41.462745 2.179175 85.000000 25.000000 9.500000 9.500000 9.500000 8.000000 8.500000 8.000000 17.500000 0.000000 8.647147 Nou Barris
3 Diagonal Mar i el Front Marítim del Poblenou 41.406055 2.211833 93.827586 42.206897 9.482759 9.517241 9.620690 9.586207 9.482759 9.310345 30.758621 0.448276 3.963205 Sant Martí
4 Horta 41.434608 2.158890 92.666667 34.000000 9.833333 9.833333 9.666667 8.833333 9.333333 9.333333 24.000000 0.000000 5.636500 Horta-Guinardó

5.2 Average Price/Day decreases as we go futher from city center, and so does the location rating

In [25]:
t7['size'] = [10]*len(t7)
fig2 = px.scatter(t7, x="distance", y="price", color="location rating", size = 'size',trendline='ols',\
                  hover_name="neighbourhood", hover_data=["neighbourhood group"],\
                 color_continuous_scale=px.colors.sequential.Viridis)
fig2.show()

6. Dive in to Ciutat Vella

6.1 Rename the columns of our original dataset and focus on Ciutat Vella

In [26]:
df_new = df.filter(items= ['id','neighbourhood','latitude','longitude','price','review_scores_rating','review_scores_checkin','review_scores_communication',\
       'review_scores_accuracy','review_scores_location','review_scores_cleanliness','review_scores_value','neighbourhood_group','super host','number_of_reviews'])
modify = {"review_scores_rating": "review rating", "review_scores_checkin": "checkin rating",\
          "review_scores_communication": "communication rating",\
          "review_scores_accuracy":'accuracy rating',\
          'review_scores_location':'location rating',\
          'review_scores_cleanliness': 'cleanliness rating',\
          'review_scores_value': 'value rating',\
          'number_of_reviews': 'number of reviews',\
          'neighbourhood_group': 'neighbourhood group'}
df_new = df_new.rename(columns = modify)
df_new.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5585 entries, 3 to 20419
Data columns (total 15 columns):
id                      5585 non-null object
neighbourhood           5585 non-null object
latitude                5585 non-null object
longitude               5585 non-null float64
price                   5585 non-null float64
review rating           4724 non-null float64
checkin rating          4720 non-null float64
communication rating    4724 non-null float64
accuracy rating         4721 non-null float64
location rating         4722 non-null float64
cleanliness rating      4722 non-null float64
value rating            4722 non-null float64
neighbourhood group     5585 non-null object
super host              5585 non-null uint8
number of reviews       5585 non-null float64
dtypes: float64(10), object(4), uint8(1)
memory usage: 659.9+ KB
In [27]:
cv = df_new.where(df_new['neighbourhood group'] == 'Ciutat Vella').dropna(how = 'all')
cv['latitude'] = cv['latitude'].astype(float)
cv.head()
Out[27]:
id neighbourhood latitude longitude price review rating checkin rating communication rating accuracy rating location rating cleanliness rating value rating neighbourhood group super host number of reviews
8 34241 el Barri Gòtic 41.37916 2.17535 100.0 68.0 7.0 9.0 8.0 8.0 8.0 7.0 Ciutat Vella 0.0 8.0
23 68547 el Barri Gòtic 41.38072 2.17811 49.0 95.0 10.0 10.0 9.0 10.0 9.0 9.0 Ciutat Vella 0.0 89.0
36 74562 el Raval 41.38255 2.16836 60.0 96.0 10.0 10.0 10.0 10.0 10.0 10.0 Ciutat Vella 1.0 201.0
43 95719 Sant Pere, Santa Caterina i la Ribera 41.38390 2.18011 53.0 94.0 10.0 10.0 10.0 10.0 10.0 9.0 Ciutat Vella 1.0 230.0
54 119546 el Raval 41.37969 2.16532 55.0 97.0 10.0 10.0 10.0 10.0 10.0 10.0 Ciutat Vella 1.0 311.0
In [28]:
cv = cv.dropna()
cv.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1384 entries, 8 to 20136
Data columns (total 15 columns):
id                      1384 non-null object
neighbourhood           1384 non-null object
latitude                1384 non-null float64
longitude               1384 non-null float64
price                   1384 non-null float64
review rating           1384 non-null float64
checkin rating          1384 non-null float64
communication rating    1384 non-null float64
accuracy rating         1384 non-null float64
location rating         1384 non-null float64
cleanliness rating      1384 non-null float64
value rating            1384 non-null float64
neighbourhood group     1384 non-null object
super host              1384 non-null float64
number of reviews       1384 non-null float64
dtypes: float64(12), object(3)
memory usage: 173.0+ KB
In [29]:
cv['price'].hist()
print(cv['price'].mean())
52.42846820809248

6.2 Add distance column

In [30]:
for i,r in cv.iterrows():
    cv.at[i,'distance'] = distance((r['latitude'], r['longitude']), (41.3851, 2.1734))
cv.head()
Out[30]:
id neighbourhood latitude longitude price review rating checkin rating communication rating accuracy rating location rating cleanliness rating value rating neighbourhood group super host number of reviews distance
8 34241 el Barri Gòtic 41.37916 2.17535 100.0 68.0 7.0 9.0 8.0 8.0 8.0 7.0 Ciutat Vella 0.0 8.0 0.680240
23 68547 el Barri Gòtic 41.38072 2.17811 49.0 95.0 10.0 10.0 9.0 10.0 9.0 9.0 Ciutat Vella 0.0 89.0 0.625794
36 74562 el Raval 41.38255 2.16836 60.0 96.0 10.0 10.0 10.0 10.0 10.0 10.0 Ciutat Vella 1.0 201.0 0.507154
43 95719 Sant Pere, Santa Caterina i la Ribera 41.38390 2.18011 53.0 94.0 10.0 10.0 10.0 10.0 10.0 9.0 Ciutat Vella 1.0 230.0 0.575488
54 119546 el Raval 41.37969 2.16532 55.0 97.0 10.0 10.0 10.0 10.0 10.0 10.0 Ciutat Vella 1.0 311.0 0.903506

6.3 Correlation Heatmap

1. We found that whether the host is a super_host, overall review rating and number of reviews are important factors in Ciutat Vella

2. The analysis makes sense that distance and location rating is not really important is this area, because the area is close to the coast, being slightly further from city center does not affect the price much

In [31]:
import plotly.figure_factory as ff
z=cv[cv.columns[4:]].corr().values.tolist()
z_text = np.around(z, decimals=3)
fig = ff.create_annotated_heatmap(z=cv[cv.columns[4:]].corr().values.tolist(),\
                   x=list(cv[cv.columns[4:]].corr().columns),\
                   y=list(cv[cv.columns[4:]].corr().index), annotation_text=z_text, colorscale='Viridis')


fig.show()

7. Dive in to Sarrià-Sant Gervasi

7.1 Focus on Sarrià-Sant Gervasi

In [32]:
ss = df_new.where(df_new['neighbourhood group'] == 'Sarrià-Sant Gervasi').dropna(how = 'all')
ss['latitude'] = ss['latitude'].astype(float)
for i,r in ss.iterrows():
    ss.at[i,'distance'] = distance((r['latitude'], r['longitude']), (41.3851, 2.1734))
ss.head()
Out[32]:
id neighbourhood latitude longitude price review rating checkin rating communication rating accuracy rating location rating cleanliness rating value rating neighbourhood group super host number of reviews distance
294 528979 Sant Gervasi - Galvany 41.39760 2.15404 45.0 95.0 10.0 10.0 10.0 10.0 10.0 10.0 Sarrià-Sant Gervasi 0.0 90.0 2.130766
736 985310 Sant Gervasi - Galvany 41.39863 2.14888 65.0 97.0 10.0 10.0 10.0 10.0 10.0 9.0 Sarrià-Sant Gervasi 1.0 35.0 2.539138
985 1251176 Sant Gervasi - Galvany 41.39982 2.14289 60.0 100.0 10.0 10.0 10.0 9.0 10.0 10.0 Sarrià-Sant Gervasi 0.0 2.0 3.025981
1019 1273077 Sarrià 41.39749 2.12133 45.0 97.0 10.0 10.0 10.0 10.0 10.0 10.0 Sarrià-Sant Gervasi 0.0 42.0 4.556917
1261 1667342 Sant Gervasi - Galvany 41.39340 2.13909 92.0 98.0 10.0 10.0 10.0 10.0 10.0 10.0 Sarrià-Sant Gervasi 1.0 173.0 3.007338
In [33]:
ss = ss.dropna()
ss.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 125 entries, 294 to 19582
Data columns (total 16 columns):
id                      125 non-null object
neighbourhood           125 non-null object
latitude                125 non-null float64
longitude               125 non-null float64
price                   125 non-null float64
review rating           125 non-null float64
checkin rating          125 non-null float64
communication rating    125 non-null float64
accuracy rating         125 non-null float64
location rating         125 non-null float64
cleanliness rating      125 non-null float64
value rating            125 non-null float64
neighbourhood group     125 non-null object
super host              125 non-null float64
number of reviews       125 non-null float64
distance                125 non-null float64
dtypes: float64(13), object(3)
memory usage: 16.6+ KB
In [34]:
ss['price'].hist()
print(ss['price'].mean())
54.72

Price in Sarrià-Sant Gervasi is slightly higher on avera

7.2 Correlation Heatmap

As Sarrià-Sant Gervasi is far from the city, the distance factor is more importnat in this area comparing to in Ciutat Vella

As the majority of Sarrià-Sant Gervasi is on the mountain, the location matters more! That is why location rating is much more correlated with price

Communication rating and Checkin Rating also are important factors

In [35]:
import plotly.figure_factory as ff
z=ss[ss.columns[4:]].corr().values.tolist()
z_text = np.around(z, decimals=3)
fig = ff.create_annotated_heatmap(
                   z=ss[ss.columns[4:]].corr().values.tolist(),
                   x=list(ss[ss.columns[4:]].corr().columns),
                   y=list(ss[ss.columns[4:]].corr().index),annotation_text=z_text, colorscale='Viridis'
                   )
fig.show()